This tutorial summarizes details of the Bayesian Joint Spike-and-slab
Graphical Lasso (SSJGL) algorithm (Li et. al 2019), then walks through
the steps needed to run the spikeyglass package using a
real-world metabolomics dataset.
Code throughout the document is “hidden” but if you click on “CODE” in the upper right above any R output, you can expand it to view code and (extensive) comments!
All code is available on my GitHub.
To run this tutorial on your computer please either clone the entire
repository to your computer, double click
ssjgl-tutorial.Rproj, open the .Rmd file and run the code
as is after making sure you have installed the appropriate packages in R
(below).
You can alternatively download the .csv and .Rmd files here and run the script after installing appropriate packages (below, first code chunk).
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
#devtools::install_github("mljaniczek/spikeyglass")
library(spikeyglass)
library(igraph)
library(RColorBrewer)
source("helpers.R") #custom helper functions for pretty igraph plots
theme_set(theme_bw()) #setting ggplot theme
igraph_options(annotate.plot = TRUE) # setting default to annotate plot with number of vertices and edges
# read in metabolomics data from Raji Balasubramanian, which is on the github for this tutorial
#http://github.com/mljaniczek/ssjgl-tutorial
dat <- read.csv("hapo_metabolomics_2020.csv")
See the paper “Bayesian Joint Spike-and-Slab Graphical Lasso” for a full description of the method. Here we will briefly describe the procedure.
Within the world of frequentist joint graphical model estimation, joint graphical lasso (JGL) (Guo 2011) is one of the most popular methods.
JGL falls under the family of penalized regression, and is the basis of multiple subsequent method formulations Of note, fused graphical lasso (FGL) and group graphical lasso (GGL) (Danaher 2014) are the most widely used methods which expand on JGL.
FGL encourages off-diagonal elements of the estimated precision matrix to share both patterns of sparsity and similar values of non-zero elements, while GGL only encourages similar sparsity patterns across groups.
However, these methods are extremely sensitive to choice of penalty parameters, and require methods like cross validation for tuning, which in turn requires a large enough sample size (and often large computational power) to run. Even after tuning penalty parameters using CV based on AIC, JGL methods tend to favor large models.
Bayesian joint graphical model estimation can be designed to overcome some of these difficulties, encourage sparsity of graph solutions, and account for uncertainty in the graph structure (according to Ni 2021)
Li, McCormick, and Clark proposed a Bayesian spike-and-slab joint graphical lasso (SSJGL) with an EM algorithm approach which results in a point estimate of the graphs and can scale to high dimensions
They propose Bayesian formulations of the popular frequentist FGL and GGL methods, called DSS-FGL and DSS-GGL respectively.
I became really excited about trying out this method and comparing it to FGL and GGL! While scripts were available, I decided to make an R package based off of them called (spikeyglass) so that myself and others could more easily apply SSJGL to real-world datasets.
So let’s get to an application of Bayesian joint spike-and-slab graphical lasso now! We will essentially be taking the data from each ancestry section simultaneously perform variable selection and shrinkage by using the doubly-spike-and-slab extensions to Danaher’s 2014 GGL and FGL.
Data comes from Hyperglycemia and Adverse Pregnancy Outcome (HAPO) Study. Metabolites were measured from blood samples taken from pregnant women from four ancestry groups: Afro-Caribbean, Mexican American, Northern European, and Thai.
The data consists of ID variable, ancestry group, and metabolites (p = 51 features). The ancestry groups (K = 4) are balanced (n = 400 each, with 1600 total observations).There is also a variable “fpg” which is fasting plasma glucose, which I omit from this analysis.
Within the data there are three groups of metabolites: Amino Acids, Acycl carnitines, and Other.
Here’s what the raw data look like:
# examine ancestry group variable
#summary(as.factor(dat$anc_gp))
head(dat)[,1:10]
## id anc_gp fpg mt1_1 mt1_2 mt1_3 mt1_4 mt1_5 mt1_6
## 1 hm0001 ag3 75.6 218.2223 76.99525 19.06366 14.23091 86.75162 135.2109
## 2 hm0002 ag3 84.6 292.6314 136.41320 43.14854 17.77549 120.17344 213.6531
## 3 hm0003 ag4 79.2 361.1135 79.98370 22.15848 13.05497 74.75441 136.1587
## 4 hm0004 ag3 73.8 274.0428 79.76154 19.72682 12.68744 69.34854 146.7262
## 5 hm0005 ag1 91.8 270.9049 88.98260 39.20949 15.73498 95.56857 145.2876
## 6 hm0006 ag4 73.8 271.4034 99.63992 27.81137 20.91344 95.55926 223.7438
## mt1_7
## 1 64.00578
## 2 91.30156
## 3 83.67878
## 4 72.67280
## 5 48.49431
## 6 68.94172
We can take a look at a few of the metabolites to get an idea of distribution and missingness.
summary(dat$mt1_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 115.6 281.8 322.4 328.5 370.6 660.4
summary(dat$mt2_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -4.631 -3.341 -3.040 -3.044 -2.750 -1.154 1
summary(dat$mt3_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 17.28 21.39 22.64 22.96 24.64 27.25 11
Assumptions of SSJGL need data to be centered and scaled so we must center and scale each metabolite feature.
Here I am making the data into “long” format and standardizing each metabolite separately. Check out how the data look now:
# prepare data into list of K datasets as needed as input to ssjgl function
# make data into long format then center and scale by metabolite
## we do not want to center and scale within each group because that would remove any signal we see, right?
dat_long <- dat %>%
select(-fpg) %>%
pivot_longer(cols = starts_with("mt"),
names_to = "metabolite",
values_to = "value",
values_drop_na = TRUE) %>%
group_by(metabolite) %>%
mutate(scaled_value = scale(value, center = TRUE, scale = TRUE)) %>%
ungroup()
head(dat_long)
## # A tibble: 6 × 5
## id anc_gp metabolite value scaled_value[,1]
## <chr> <chr> <chr> <dbl> <dbl>
## 1 hm0001 ag3 mt1_1 218. -1.66
## 2 hm0001 ag3 mt1_2 77.0 -0.942
## 3 hm0001 ag3 mt1_3 19.1 -1.30
## 4 hm0001 ag3 mt1_4 14.2 -0.744
## 5 hm0001 ag3 mt1_5 86.8 -0.591
## 6 hm0001 ag3 mt1_6 135. -1.17
Now we can check on how the distribution of a few of the variables look now. Data looks roughly normal. Values of metabolites in class 3 (“Other”) appear to have variable mean between ancestry groups.
# if you want to look at one metabolite
# dat_long %>%
# filter(metabolite %in% c("mt1_1")) %>%
# ggplot(aes(x=value, color=anc_gp)) +
# geom_density()+
# facet_grid(metabolite ~.)
# demonstrative density plot showing centering/scaling of data
dat_long %>%
filter(metabolite %in% c("mt1_1", "mt2_1", "mt3_1")) %>%
ggplot(aes(x=scaled_value, color=anc_gp)) +
geom_density()+
facet_grid(metabolite ~.)
Now the raw data is centered and scaled! But first we need to get it
into K n by p matrices as required by the
spikeyglass package.
So in our case, we need 4 matrices (4 ancestry groups), each with 400 observations of 51 metabolites.
Note: Normally we would impute any missing data. For simplicity in this code I treated missing data as “0” value.
# make list of ancestry groups
ancestry_groups <- sort(unique(dat$anc_gp))
# filter by ancestry group, then make wide matrix of scaled values
# create a list of the results
# Here I'm using the `purrr::map` to iterate over the 4 ancestry groups and make the list. You could use base R lapply instead if you prefer.
dat_mat_list <- purrr::map(ancestry_groups,
~dat_long %>%
filter(anc_gp == .x) %>%
select(-c(anc_gp, value)) %>%
pivot_wider(names_from = metabolite,
values_from = scaled_value,
#for simplicity putting 0 as missing values. Normally should do imputation.
values_fill = 0) %>%
select(-id) %>%
as.matrix())
names(dat_mat_list) <- ancestry_groups
Let’s check to see if our list contains the right structure:
str(dat_mat_list)
## List of 4
## $ ag1: num [1:400, 1:51] -0.869 -0.474 -0.893 1.082 0.578 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:51] "mt1_1" "mt1_2" "mt1_3" "mt1_4" ...
## $ ag2: num [1:400, 1:51] 0.169 0.698 -1.174 -1.012 -0.591 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:51] "mt1_1" "mt1_2" "mt1_3" "mt1_4" ...
## $ ag3: num [1:400, 1:51] -1.664 -0.541 -0.822 0.152 -0.591 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:51] "mt1_1" "mt1_2" "mt1_3" "mt1_4" ...
## $ ag4: num [1:400, 1:51] 0.4919 -0.8616 0.5032 0.0342 -0.157 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:51] "mt1_1" "mt1_2" "mt1_3" "mt1_4" ...
It does! So now we have a list of 4 matrices, each with 400 standardized observations of 51 metabolites.
Before we proceed, let’s visualize the processed data quick in a heatmap to get an idea of if the data has any patterns.
# making one full wide data frame with all 1600 observations, using the centered-scaled values for metabolites that we processed above.
met_wide_all <- dat_long %>%
select(-value) %>%
pivot_wider(names_from = metabolite,
values_from = scaled_value,
#for simplicity putting 0 as missing values. Normally should do imputation.
values_fill = 0) %>%
arrange(anc_gp) #arranging by ancestry group for the heatmap
#make matrix out of dataframe, removing the id and group labels since pheatmap just takes numeric matrix as input
met_wide_all_mat <- t(as.matrix(met_wide_all%>%
select(-c(id, anc_gp))))
met_wide_all_mat[is.na(met_wide_all_mat)] <- 0
colnames(met_wide_all_mat) <- met_wide_all$id
#pheatmap can annotate your heatmap nicely with colors per class! so I'm preparing those here.
#first I want the columns annotated by ancestry group
my_sample_col <- data.frame(ancestry = as.factor(met_wide_all$anc_gp))
rownames(my_sample_col) <- met_wide_all$id
# then I want the rows annotated by metabolite group, we have three groups of metabolites mt1, mt2, mt3
met_group <- str_split_fixed(rownames(met_wide_all_mat), "_", 2)[,1] %>%
str_replace_all(c("mt1" = "Amino Acids",
"mt2" = "Acyl carnitines",
"mt3" = "Other"))
my_sample_row <- data.frame(met_group = met_group)
row.names(my_sample_row) <- row.names(met_wide_all_mat)
# finally we input the numeric matrix and annotations
pheatmap::pheatmap(met_wide_all_mat,
cluster_cols = FALSE,
show_colnames = FALSE,
show_rownames = FALSE,
annotation_col = my_sample_col,
annotation_row = my_sample_row)
Here the x axis has the 1600 samples, sorted by ancestry group. The y axis has the 51 metabolites, clustered using hierarchical clustering, and with a color label based on metabolite group (there are 3 “groups” in the data, with suffix mt1, mt2, and mt3 which indicate Amino Acids, Acyl carnitines, and other groups respectively).
From this heatmap we can see there are big differences in the values for “Other” metabolite group (pink) across the 4 ancestry groups. metabolites from Amino Acids and Acyl Carnitines groups appear to be more homogeneous across the classes.
I created the spikeyglass package using scripts provided
with the paper:
Li ZR, McCormick TH, Clark SJ. Bayesian Joint Spike-and-Slab Graphical Lasso. Proc Mach Learn Res. 2019 Jun;97:3877-3885. PMID: 33521648; PMCID: PMC7845917.
The package allows us to mirror the same analysis as the frequentist
Fused Graphical Lasso method in the JGL package.
library(spikeyglass)
We will first use the spikeyglass package to run doubly
spike-and-slab fused graphical lasso (DSS-FGL) to jointly estimate the
graphs in our 4 ancestry groups.
Per the SSJGL paper, the converged region is insensitive to the choice of lambda1 and lambda2, and the zero and non-zero elements under DSS-FGL tend to be quite stable as the shrinkage reaches a certain point.
So to start off we will set the parameters as follows, and will demonstrate the stability of the results in a later section.
lam1 <- 1
lam2 <- 1
v1 <- 1
lam.eff <- lam1 + c(1:10) * 5
v0s <- lam1/lam.eff
set.seed(1219)
G <- 4
p <- 51
penalty <- "fused"
lam1 <- 1
lam2 <- 1
v1 <- 1
lam.eff <- lam1 + c(1:10) * 5
v0s <- lam1/lam.eff
v0s
## [1] 0.16666667 0.09090909 0.06250000 0.04761905 0.03846154 0.03225806
## [7] 0.02777778 0.02439024 0.02173913 0.01960784
The ssjgl() function takes input similar to the
JGL package, with a few additional parameters as specified
above.
fit1 = ssjgl(Y=dat_mat_list,penalty=penalty,lambda0=1, lambda1=lam1,lambda2=lam2, v1 = v1, v0s = v0s, a=1, b=p, doubly=TRUE, normalize=FALSE)
First, we can plot the solution path and estimated precision matrices of SSJGL, as done in the paper. Essentially we are plotting the estimated partial correlations between all nodes over the iterations of the EM algorithm.
We can see that over the iterations many of the partial correlations shrink down to 0 while the non-zero elements stabilize, which helps us feel confident that the EM algorithm converged.
# Get plot titles
mains <- c(paste0("Class 1"),
paste0("Class 2"),
paste0("Class 3"),
"Class 4")
plot_path(lam1/v0s, fit1, thres = 0, normalize = T, xlab = expression(lambda[1]/v[0]), ylab = "Partial correlations", main = mains, par = c(2, 2),
position = "none")
Now we can plot the estimated partil correlations from our DSS-FGL solution.
We see that for each class, there are 3 big “chunks” within the heatmap, which correspond with the 3 classes of metabolites. The patterns do look largely similar.
par(mfrow = c(2, 2))
for(i in 1:length(dat_mat_list)){
corrplot::corrplot(-cov2cor(fit1$thetalist[[10]][[i]]), method="color", diag=F, is.corr=F)
}
#I am also going to specify colors based on the three metabolite groups and whether or not an edge indicates positive or negative correlation.
# specify colors for vertices (groups of metabolites in our case)
color_opts <- RColorBrewer::brewer.pal(3, "Dark2")
color_list <- data.frame(class = met_group) %>%
mutate(color = case_when(
class == "Amino Acids" ~ color_opts[1],
class == "Acyl carnitines" ~ color_opts[2],
class == "Other" ~ color_opts[3],
TRUE ~ NA_character_
)
)
# color and labels for edges
edge_palette <- brewer.pal(n = 8, name = "RdBu")[c(8,1)]
edge_labels <- c("Positive", "Negative")
Finally we will extract estimated covariance matrices and prepare for
visualizing the graphical model using the igraph
package.
# extract all estimated covariance matrices from result
inv_covar_matrices <- fit1$thetalist[[length(v0s)]]
names(inv_covar_matrices) <- ancestry_groups
#now use `graph_from_adjacency_matrix()` function from igraph to create igraph graphs from adjacency matrices
#again since we have a list covariance matrices I use the map() function to apply igraph::graph_from_adjacency_matrix() over the list
graph_list <- map(ancestry_groups,
~graph_from_adjacency_matrix(
-cov2cor(inv_covar_matrices[[.x]]),
weighted = T,
mode = "undirected",
diag = FALSE
))
names(graph_list) <- ancestry_groups
plot_jgl(graph_list[[1]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 1, DSS-FGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
plot_jgl(graph_list[[2]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 2, DSS-FGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
plot_jgl(graph_list[[3]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 3, DSS-FGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
plot_jgl(graph_list[[4]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 4, DSS-FGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
Now we will run the doubly spike-and-slab group graphical lasso (DS-GGL). All that needs to be changed in the code is to specify the “group” penalty.
penalty = "group"
fit_group = ssjgl(Y=dat_mat_list,penalty=penalty,lambda0=1, lambda1=lam1,lambda2=lam2, v1 = v1, v0s = v0s, a=1, b=p, doubly=TRUE, normalize=FALSE)
As above, we can plot the solution paths to get a sense of if the solution converged.
# Get plot titles
mains <- c(paste0("Class 1"),
paste0("Class 2"),
paste0("Class 3"),
"Class 4")
plot_path(lam1/v0s, fit_group, thres = 0, normalize = T, xlab = expression(lambda[1]/v[0]), ylab = "Partial correlations", main = mains, par = c(2,2),
position = "none")
Now we can plot the estimated partil correlations from our DSS-FGL solution.
We see that for each class, there are 3 big “chunks” within the heatmap, which correspond with the 3 classes of metabolites. The patterns do look largely similar.
par(mfrow = c(2, 2))
for(i in 1:length(dat_mat_list)){
corrplot::corrplot(-cov2cor(fit_group$thetalist[[10]][[i]]), method="color", diag=F, is.corr=F)
}
Extract estimated covariance matrices and prepare for visualization.
# extract all estimated covariance matrices from result
inv_covar_matrices2 <- fit_group$thetalist[[length(v0s)]]
names(inv_covar_matrices2) <- ancestry_groups
graph_list_group <- map(ancestry_groups,
~graph_from_adjacency_matrix(
-cov2cor(inv_covar_matrices2[[.x]]),
weighted = T,
mode = "undirected",
diag = FALSE
))
names(graph_list_group) <- ancestry_groups
plot_jgl(graph_list_group[[1]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 1, DSS-GGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
plot_jgl(graph_list_group[[2]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 2, DSS-GGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
plot_jgl(graph_list_group[[3]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 3, DSS-GGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
plot_jgl(graph_list_group[[4]], multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 4, DSS-GGL")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
To demonstrate the effect of various initial parameter choices I ran several additional models with “extreme” parameter choices.
My main takeaway is that the results are consistent and stable no matter the input parameters, with the exception that if \(\lambda_1\) or \(\lambda_2\) are 0 then the solution is completely sparse with no edges.
i = 1
test_graph <- graph_from_adjacency_matrix(-cov2cor(fit_group_smallp$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 1, DSS-GGL, a = b = 1")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
test_graph2 <- graph_from_adjacency_matrix(-cov2cor(fit_group_bigabigb$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph2, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = "Class 1, DSS-GGL, a = b = p")
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
test_graph3 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam0_0$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph3, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = bquote("Class 1, DSS-GGL"~lambda[0]==0))
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
test_graph3 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam0_01$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph3, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = bquote("Class 1, DSS-GGL"~lambda[0]==0.1))
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
test_graph3 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam0_10$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph3, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = bquote("Class 1, DSS-GGL"~lambda[0]==10))
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
No edges
test_graph4 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam1_0$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
# plot_jgl(test_graph4, multiplier = 3,
# vertex.color = color_list$color,
# vertex.label.color = color_list$color,
# main = "Class 1, DSS-GGL, zero lambda_1")
# legend("bottomleft", inset=.02, title="Metabolite class",
# unique(color_list$class), fill=color_opts, cex=0.8)
# legend("topleft", inset=.02, title="Edge",
# edge_labels, fill=edge_palette, cex=0.8)
No edges
# test_graph4 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam1_01$thetalist[[length(v0s)]][[i]]),
# weighted = T,
# mode = "undirected",
# diag = FALSE)
# plot_jgl(test_graph4, multiplier = 3,
# vertex.color = color_list$color,
# vertex.label.color = color_list$color,
# main = "Class 1, DSS-GGL, lambda_1 = 0.1")
# legend("bottomleft", inset=.02, title="Metabolite class",
# unique(color_list$class), fill=color_opts, cex=0.8)
# legend("topleft", inset=.02, title="Edge",
# edge_labels, fill=edge_palette, cex=0.8)
test_graph4 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam1_10$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph4, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = bquote("Class 1, DSS-GGL"~lambda[1]==10))
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
No edges
test_graph4 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam2_0$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
# plot_jgl(test_graph4, multiplier = 3,
# vertex.color = color_list$color,
# vertex.label.color = color_list$color,
# main = "Class 1, DSS-GGL, zero lambda_1")
# legend("bottomleft", inset=.02, title="Metabolite class",
# unique(color_list$class), fill=color_opts, cex=0.8)
# legend("topleft", inset=.02, title="Edge",
# edge_labels, fill=edge_palette, cex=0.8)
test_graph4 <- graph_from_adjacency_matrix(-cov2cor(fit_group_lam2_01$thetalist[[length(v0s)]][[i]]),
weighted = T,
mode = "undirected",
diag = FALSE)
plot_jgl(test_graph4, multiplier = 3,
vertex.color = color_list$color,
vertex.label.color = color_list$color,
main = bquote("Class 1, DSS-GGL"~lambda[2]==0.1))
legend("bottomleft", inset=.02, title="Metabolite class",
unique(color_list$class), fill=color_opts, cex=0.8)
legend("topleft", inset=.02, title="Edge",
edge_labels, fill=edge_palette, cex=0.8)
Check out my tutorial on the frequentist method “Joint Graphical Lasso” (JGL) using the same metabolite data in this tutorial to compare the analysis results.
A few observations I have:
While SSJGL can take some time to run depending on your data size and parameters, I found it to be much faster than having to do cross-validation to tune the penalty terms using JGL. SSJGL does not require cross-validation or tuning, because the results are quite stable to the choice of input parameters.
I also found the final result of SSJGL to be much more sparse compared to JGL. For example in this metabolite dataset, using DSS-GGL to jointly estimate the graph structure, Class 1 had 138 edges. This contrasts to the final tuned GGL graph, which had 240 edges.
I also found the results of SSJGL to be stable no matter what starting parameters I threw at it, with the number of the edges in the solution ranging from 102-148 edges. This contrasts to when I threw all kinds of lambdas into JGL and the solution for this same dataset ranged from 44 edges (with large penalty parameter lambdas) to 692 edges (with small lambdas).
Danaher P, Wang P and Witten D (2011). The joint graphical lasso for inverse covariance estimation across multiple classes. http://arxiv.org/abs/1111.0324
Li ZR, McCormick TH, Clark SJ. Bayesian Joint Spike-and-Slab Graphical Lasso. Proc Mach Learn Res. 2019 Jun;97:3877-3885. PMID: 33521648; PMCID: PMC7845917.
Ni Y, Baladandayuthapani V, Vannucci M, Stingo FC. Bayesian graphical models for modern biological applications. Stat Methods Appt. 2022;31(2):197-225. doi:10.1007/s10260-021-00572-8
Raji Balasubramanian, Denise Scholtens (2021). ENAR Conference Presentation. (Metabolomics dataset and description)
igraph: for plotting network
graphs
JGL: for
running Joint Graphical Lasso
tidyverse: for data wrangling
RColorBrewer: for pretty color palettes
pheatmap: My favorite heatmap package, makes the heatmaps with annotated tracks (See this blog post by Dave Tang for a nice tutorial on using this package.)
I built the tutorial site using GitHub pages and RMarkdown. See here for Jenny Bryant’s guide to using Git and Rmarkdown.
Thank you to Dr. Richard Li for providing SSJGL scripts and encouragement for R package creation!
Thank you to Dr. Raji Balasubramanian for the metabolomics data, which was previously presented at ENAR 2021 with Dr. Denise Scholtens.
Thank you to Dr. Kate Hoff Shutta and Dr. Yukun Li who provided code for the beautiful circular network graphs.